The core of tokenizers
, written in Rust.
Provides an implementation of today's most used tokenizers, with a focus on performance and
versatility.
What is a Tokenizer
A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding
.
The various steps of the pipeline are:
- The
Normalizer
: in charge of normalizing the text. Common examples of normalization are the unicode normalization standards, such asNFD
orNFKC
. - The
PreTokenizer
: in charge of creating initial words splits in the text. The most common way of splitting text is simply on whitespace. - The
Model
: in charge of doing the actual tokenization. An example of aModel
would beBPE
orWordPiece
. - The
PostProcessor
: in charge of post-processing theEncoding
to add anything relevant that, for example, a language model would need, such as special tokens.
Quick example
use ;
use BPE;
Additional information
- tokenizers is designed to leverage CPU parallelism when possible. The level of parallelism is determined
by the total number of core/threads your CPU provides but this can be tuned by setting the
RAYON_RS_NUM_CPUS
environment variable. As an example settingRAYON_RS_NUM_CPUS=4
will allocate a maximum of 4 threads. Please note this behavior may evolve in the future